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Abstract 


Recently there has been substantial interest in spectral methods for learning dy¬ 
namical systems. These methods are popular since they often offer a good tradeoff 
between computational and statistical efficiency. Unfortunately, they can be dif¬ 
ficult to use and extend in practice: e.g., they can make it difficult to incorporate 
prior information such as sparsity or structure. To address this problem, we present 
a new view of dynamical system learning: we show how to learn dynamical sys¬ 
tems by solving a sequence of ordinary supervised learning problems, thereby 
allowing users to incorporate prior knowledge via standard techniques such as 
L\ regularization. Many existing spectral methods are special cases of this new 
framework, using linear regression as the supervised learner. We demonstrate the 
effectiveness of our framework by showing examples where nonlinear regression 
or lasso let us learn better state representations than plain linear regression does; 
the correctness of these instances follows directly from our general analysis. 


1 Introduction 


Likelihood-based approaches to learning dynamical systems, such as EM jT| and MCMC |[2|, can 
be slow and suffer from local optima. This difficulty has resulted in the development of so-called 
“spectral algorithms” G), which rely on factorization of a matrix of observable moments; these 
algorithms are often fast, simple, and globally optimal. 

Despite these advantages, spectral algorithms fall short in one important aspect compared to EM and 
MCMC: the latter two methods are meta-algorithms or frameworks that offer a clear template for 
developing new instances incorporating various forms of prior knowledge. For spectral algorithms, 
by contrast, there is no clear template to go from a set of probabilistic assumptions to an algorithm. 
In fact, researchers often relax model assumptions to make the algorithm design process easier, 
potentially discarding valuable information in the process. 

To address this problem, we propose a new framework for dynamical system learning, using the 
idea of instrumental-variable regression IS El to transform dynamical system learning to a sequence 
of ordinary supervised learning problems. This transformation allows us to apply the rich literature 
on supervised learning to incorporate many types of prior knowledge. Our new methods subsume a 
variety of existing spectral algorithms as special cases. 

The remainder of this paper is organized as follows: first we formulate the new learning framework 
CSec. |2]». We then provide theoretical guarantees for the proposed methods (Sec. [4]). Finally, we give 
two examples of how our techniques let us rapidly design new and useful dynamical system learning 
methods by encoding modeling assumptions CSec. [5]). 
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S1A regression -$E[q t \h t ] 




Condition on o t (filter) q t+1 
Marginalize o t (predict) qt+i\t-i 


Figure 1: A latent-state dynamical system. 
Observation ot is determined by latent state 
St and noise e t . 


Figure 2: Learning and applying a dynami¬ 
cal system with instrumental regression. The 
predictions from S1 provide training data to 
S2. At test time, we filter or predict using the 
weights from S2. 


2 A framework for spectral algorithms 

A dynamical system is a stochastic process (i.e., a distribution over sequences of observations) such 
that, at any time, the distribution of future observations is fully determined by a vector s f called the 
latent state. The process is specified by three distributions: the initial state distribution -P(si), the 
state transition distribution P(s t +i \ s t ), and the observation distribution P(o t | s t ). For later use, 
we write the observation o t as a function of the state St and random noise e t , as shown in Figure [l] 

Given a dynamical system, one of the fundamental tasks is to perform inference, where we predict 
future observations given a history of observations. Typically this is accomplished by maintaining 
a distribution or belief over states b t \ t -\ = P(s t \ oi.t-i) where denotes the first t — 1 

observations. b t \t-i represents both our knowledge and our uncertainty about the true state of the 
system. Two core inference tasks are filtering and prediction Q In filtering, given the current belief 
bt = b t \ t _i and a new observation ot, we calculate an updated belief bt+i = b t+ i| t that incorporates 
Ot. In prediction, we project our belief into the future: given a belief b t \t~i we estimate b t +k\t-i = 
P{s t+k | for some fc > 0 (without incorporating any intervening observations). 

The typical approach for learning a dynamical system is to explicitly learn the initial, transition, and 
observation distributions by maximum likelihood. Spectral algorithms offer an alternate approach 
to learning: they instead use the method of moments to set up a system of equations that can be 
solved in closed form to recover estimates of the desired parameters. In this process, they typically 
factorize a matrix or tensor of observed moments—hence the name “spectral.” 

Spectral algorithms often (but not always (6)) avoid explicitly estimating the latent state or the initial, 
transition, or observation distributions; instead they recover obserx’able operators that can be used 
to perform filtering and prediction directly. To do so, they use an observable representation: instead 
of maintaining a belief b t over states St, they maintain the expected value of a sufficient statistic of 
future observations. Such a representation is often called a (transformed) predictive state f7). 

In more detail, we define q t = qt\t -i = | oi ; t-i], where ijj t = ip(op.t+k- 1 ) is a vector of future 

features. The features are chosen such that q t determines the distribution of future observations 
P(op.t+k- 1 I Filtering then becomes the process of mapping a predictive state q t to q t+ 1 

conditioned on o t , while prediction maps a predictive state q t = qt\t-i to q t +k\t-i = E[V ; i+fc \ 
oi-t-i] without intervening observations. 


1 There are other forms of inference in addition to filtering and prediction, such as smoothing and likelihood 
evaluation, but they are outside the scope of this paper. 

2 For convenience we assume that the system is fc-observable: that is, the distribution of all future obser¬ 
vations is determined by the distribution of the next k observations. (Note: not by the next k observations 
themselves.) At the cost of additional notation, this restriction could easily be lifted. 
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A typical way to derive a spectral method is to select a set of moments involving work out the 
expected values of these moments in terms of the observable operators, then invert this relationship 
to get an equation for the observable operators in terms of the moments. We can then plug in an 
empirical estimate of the moments to compute estimates of the observable operators. 

While effective, this approach can be statistically inefficient (the goal of being able to solve for the 
observable operators is in conflict with the goal of maximizing statistical efficiency) and can make 
it difficult to incorporate prior information (each new source of information leads to new moments 
and a different and possibly harder set of equations to solve). To address these problems, we show 
that we can instead learn the observable operators by solving three supervised learning problems. 

The main idea is that, just as we can represent a belief about a latent state s t as the conditional 
expectation of a vector of observable statistics, we can also represent any other distributions needed 
for prediction and filtering via their own vectors of observable statistics. Given such a representation, 
we can learn to filter and predict by learning how to map these vectors to one another. 

In particular, the key intermediate quantity for filtering is the “extended and marginalized” belief 
P(ot-,St+ 1 | oi : t-i)—or equivalently P(opt+k | oi:t-i)- We represent this distribution via a vector 
G = £(ot-.t+k) of features of the extended future. The features are chosen such that the extended 
state p t = E[G | o 1:t „i] determines P(o t:t+k | Oi :t _i). Given P{o t:t+k \ filtering and 

prediction reduce respectively to conditioning on and marginalizing over o f . 

In many models (including Hidden Markov Models (HMMs) and Kalman filters), the extended state 
p t is linearly related to the predictive state q t —a property we exploit for our framework. That is, 

Pt = Wqt (1) 

for some linear operator W. For example, in a discrete system ft can be an indicator vector repre¬ 
senting the joint assignment of the next k observations, and G can be an indicator vector for the next 
k + 1 observations. The matrix W is then the conditional probability table P(opt+k I Of.t+k- 1 ). 

Our goal, therefore, is to learn this mapping W. Naively, we might try to use linear regression for 
this purpose, substituting samples of and G in place of q t and p t since we cannot observe q t or 
p t directly. Unfortunately, due to the overlap between observation windows, the noise terms on f t 
and G are correlated. So, naive linear regression will give a biased estimate of W. 

To counteract this bias, we employ instrumental regression EE). Instrumental regression uses in¬ 
strumental variables that are correlated with the input q t but not with the noise ept+k- This property 
provides a criterion to denoise the inputs and outputs of the original regression problem: we remove 
that part of the input/output that is not correlated with the instrumental variables. In our case, since 
past observations oi-t-i do not overlap with future or extended future windows, they are not cor¬ 
related with the noise ep.t+k+ 1 , as can be seen in Figure [T] Therefore, we can use history features 
h t = h(oi-.t-i) as instrumental variables. 

In more detail, by taking the expectation of 0 given h t , we obtain an instrument-based moment 
condition: for all t, 

E[p t I ht] = E [Wqt I ht] 

E[E[G | o 1:t _ r] | ht] = WE[E[f t | oi-.t-i] \ h t ] 

E[G | ht] = WE[f t | h t ] (2) 

Assuming that there are enough independent dimensions in ht that are correlated with q t , we main¬ 
tain the rank of the moment condition when moving from 0 to 0, and we can recover W by least 
squares regression if we can compute E[f t \ h t \ and E[G | ht] for sufficiently many examples t. 

Fortunately, conditional expectations such as E[ij; t \ h t ] are exactly what supervised learning algo¬ 
rithms are designed to compute. So, we arrive at our learning framework: we first use supervised 
learning to estimate E[f t \ ht] and E[G | ht], effectively denoising the training examples, and then 
use these estimates to compute W by finding the least squares solution to 0. 

In summary, learning and inference of a dynamical system through instrumental regression can be 
described as follows: 

• Model Specification: Pick features of history h t = h{o\-t-f), future ip t = f>{op.t+k- 1 ) 
and extended future G = £( Opt+k )■ ipt must be a sufficient statistic for | 

G must satisfy 
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Model/Algorithm 

future features ip t 

extended future features 

it 

f filter 

Spectral Algorithm 
for HMM 0 

U 1 e 0± where e 0± is an indicator vec¬ 

tor and U spans the range of q t (typi¬ 
cally the top m left singular vectors of 
the joint probability table P(ot+i , ot)) 

U 1 e„ t+1 ® e ot 

Estimate a state normalizer from SI A 
output states. 

SSID for Kalman 
filters (time depen¬ 
dent gain) 

xt and xt 0 xt, where xt = 
U T Of.t+k — l for a matrix U that spans 
the range of qt (typically the top m left 
singular vectors of the covariance matrix 

Cov(e>t:t + fc l, Ot-k:t-l)) 

y t and y t ® yt. where 
yt is formed by stacking 
U T ot+i-.t+k and o t . 

Pt specifies a Gaussian distribution 
where conditioning on ot is straightfor¬ 
ward. 

SSID for stable 
Kalman filters 

(constant gain) 

U 1 Of.t+k — l (U obtained as above) 

o t and U 1 o t+1:t+fc 

Estimate steady-state covariance by 
solving Riccati equation (8). p t to¬ 
gether with the steady-state covariance 
specify a Gaussian distribution where 
conditioning on ot is straightforward. 

Uncontrolled HSE- 
PSR|9j 

Evaluation functional k s (ot-.t+k -1 ? •) 

for a characteristic kernel k s 

fe 0 (o t) .) ® k 0 (o t ,.) 

and ® k 0 {o t , ■) 

Kernel Bayes rule |10|. 


Table 1: Examples of existing spectral algorithms reformulated as two-stage instrument regression 
with linear S1 regression. Here o tl :t2 is a vector formed by stacking observations o t l through Of , and 
(gi denotes the outer product. Details and derivations can be found in the supplementary material. 


- E[ 0 t+ 1 I Oi:t_i] = /predict(E[6 | Oi :t _i]) for a known function / pre diet- 

- E[V>t+i I 0 i:f] = /filter(E[£t | oi :t _i],o t ) for a known function f mtei . 

• S1A (Stage 1A) Regression: Learn a (possibly non-linear) regression model to estimate 
iJi t = E [ip t | h t ]. The training data for this model are ( ht, if>t) across time steps 

• SIB Regression: Learn a (possibly non-linear) regression model to estimate = E[£ t | 
h t \. The training data for this model are (ht, Ct) across time steps t. 

• S2 Regression: Use the feature expectations estimated in SI A and SIB to train a model 
to predict = Wipt, where IT' is a linear operator. The training data for this model are 
estimates of (?/ t , £/) obtained from S1A and SIB across time steps t. 

• Initial State Estimation: Estimate an initial state <71 = Hi [Ti ] by averaging ipi across 
several example realizations of our time series]^] 

• Inference: Starting from the initial state q\, we can maintain the predictive state q t = 
E [ip t | o\,t~\] through filtering: given q t we compute^ = E[£ t | o\-.t-i\ = W qt . Then, 
given the observation ot, we can compute q t +\ = /filter {'Pt ■ Ot)- Or, in the absence of o t , 
we can predict the next state q t +\\t-i = /predict (pt)- Linally, by definition, the predictive 
state q t is sufficient to compute P(ot-t+k-i \ oi : t_i)j^| 

The process of learning and inference is depicted in Ligure[2] Modeling assumptions are reflected 
in the choice of the statistics ip, £ and h as well as the regression models in stages SI A and SIB. 
Table[l]demonstrates that we can recover existing spectral algorithms for dynamical system learning 
using linear SI regression. In addition to providing a unifying view of some successful learning 
algorithms, the new framework also paves the way for extending these algorithms in a theoretically 
justified manner, as we demonstrate in the experiments below. 


3 Related Work 

This work extends predictive state learning algorithms for dynamical systems, which include spec¬ 
tral algorithms for Kalman filters eh, Hidden Markov Models Ema, Predictive State Represen¬ 
tations (PSRs) 11 3 13 and Weighted Automata lfl5l . It also extends kernel variants such as 
which builds on US). All of the above work effectively uses linear regression or linear ridge regres¬ 
sion (although not always in an obvious way). 


3 Our bounds assume that the training time steps t are sufficiently spaced for the underlying process to mix, 
but in practice, the error will only get smaller if we consider all time steps t. 

4 Assuming ergodicity, we can set the initial state to be the empirical average vector of future features in a 
single long sequence, / V’t- 

3 It might seem reasonable to learn qt+i = /combined (qt, ot) directly, thereby avoiding the need to separately 
estimate p t and condition on ot ■ Unfortunately, /combined is nonlinear for common models such as HMMs. 
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One common aspect of predictive state learning algorithms is that they exploit the covariance struc¬ 
ture between future and past observation sequences to obtain an unbiased observable state represen¬ 
tation. Boots and Gordon na note the connection between this covariance and (linear) instrumental 
regression in the context of the HSE-HMM. We use this connection to build a general framework for 
dynamical system learning where the state space can be identified using arbitrary (possibly nonlin¬ 
ear) supervised learning methods. This generalization lets us incorporate prior knowledge to learn 
compact or regularized models; our experiments demonstrate that this flexibility lets us take better 
advantage of limited data. 

Reducing the problem of learning dynamical systems with latent state to supervised learning bears 
similarity to Langford et al.’s sufficient posterior representation (SPR) ifTSI . which encodes the state 
by the sufficient statistics of the conditional distribution of the next observation and represents sys¬ 
tem dynamics by three vector-valued functions that are estimated using supervised learning ap¬ 
proaches. While SPR allows all of these functions to be non-linear, it involves a rather complicated 
training procedure involving multiple iterations of model refinement and model averaging, whereas 
our framework only requires solving three regression problems in sequence. In addition, the theo¬ 
retical analysis of na only establishes the consistency of SPR learning assuming that all regression 
steps are solved perfectly. Our work, on the other hand, establishes convergence rates based on the 
performance of SI regression. 


4 Theoretical Analysis 

In this section we present error bounds for two-stage instrumental regression. These bounds hold 
regardless of the particular S1 regression method used, assuming that the S1 predictions converge to 
the true conditional expectations. The bounds imply that our overall method is consistent. 

Let (xt,yt, Zt) G (X, y, Z) be i.i.d. triplets of input, output, and instrumental variables. (Lack of 
independence will result in slower convergence in proportion to the mixing time of our process.) Let 
Xt and y t denote E[xt | zt\ and E [y t \ Zt[. And, let Xt and y t denote E [xt \ zt] and E [y t \ zt] as 
estimated by the S1A and SIB regression steps. Here x t , Xt G X and y t . y t G y. 

We want to analyze the convergence of the output of S2 regression—that is, of the weights W given 
by ridge regression between SI A outputs and SIB outputs: 

W\= (y^y t ®x t 

\t=i 

Here <g) denotes tensor (outer) product, and A > 0 is a regularization parameter that ensures the 
invertibility of the estimated covariance. 

Before we state our main theorem we need to quantify the quality of S1 regression in a way that is 
independent of the S1 functional form. To do so, we place a bound on the S1 error, and assume that 
this bound converges to zero: given the definition below, for each fixed 6, limy-^ ryy ..y = 0. 

Definition 1 (S 1 Regression Bound). For any S > 0 and N G N + , the SI regression bound rjs ,N > 0 
is a number such that, with probability at least (1 — 5 /2), for all 1 < t < N: 

||*t - x t \\x < V6,N 
II m - yt\\y < vs,n 

In many applications, X, y and Z will be finite dimensional real vector spaces: W l ' ,: , R ,i ’ i and 
K"*. However, for generality we state our results in terms of arbitrary reproducing kernel Hilbert 
spaces. In this case S2 uses kernel ridge regression, leading to methods such as HSE-PSRs. Lor 
this purpose, let E 22 and E yy denote the (uncentered) covariance operators of x and y respectively: 
E 22 = E[x g> x], E yy = E [y (g) y]. And, let 1Z(Y, SS ) denote the closure of the range of E 22 . 

With the above assumptions. Theorem [2] gives a generic error bound on S2 regression in terms of 
S1 regression. If X and y are finite dimensional and S 22 has full rank, then using ordinary least 
squares (i.e., setting A = 0) will give the same bound, but with A in the first two terms replaced by 
the minimum eigenvalue of E 22 , and the last term dropped. 



Xt + A lx 


(3) 
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Theorem 2. Assume that ||a;||*,||a:||;y < c < oo almost surely. Assume W is a Hilbert-Sclimidt 
operator, and let W\ be as defined in 0- Then, with probability at least 1 — 8, for each cct es t £ 
TZ(T, SS ) s.t. latest ||* ^ 1- the error IIW^A^test - Wattestlly w bounded by 


f 

fi 

\ 

V5,N 

A +V At 


V 

l ) 

) 


V 

error in SI regression 


+ o 


( log(V^) /I 1\\ 

1 Viv u \i)J 

-V" - - ' 

error from finite samples 



error from regularization 


We defer the proof to the supplementary material. The supplementary material also provides explicit 
finite-sample bounds (including expressions for the constants hidden by O-notation), as well as 
concrete examples of SI regression bounds ; y for practical regression models. 

Theorem [ 2 ] assumes that attest is in 'R(T, XX '). For dynamical systems, all valid states satisfy this 
property. However, with finite data, estimation errors may cause the estimated state q t (i.e., attest) to 
have a non-zero component in 7£ J -(£ SS ). Lemmaplbounds the effect of such errors: it states that, in 
a stable system, this component gets smaller as Si regression performs better. The main limitation 
of Lemma [3] is the assumption that /filter is L-Lipchitz, which essentially means that the model’s 
estimated probability for o t is bounded below. There is no way to guarantee this property in practice; 
so, Lemma[3]provides suggestive evidence rather than a guarantee that our learned dynamical system 
will predict well. 

Lemma 3. For observations o\ : t, let q t be the estimated state given Let q t be the projection 

of qt onto 1 Z(Yi S x)- Assume /filter is L-Lipchitz on p t when evaluated at Ot, and /filter (pt, Ot) £ 
lZ(Y,xx) for any pt £ 7 Z(Eyy). Given the assumptions of theoremxAand assuming that ||/t||* < R 
for all 1 < t < T, the following holds for all 1 < t < T with probability at least 1 — 5/2. 

NU = ||g t -g t |U = o(^) 

Since W\ is bounded, the prediction error due to e t diminishes at the same rate as ||ef ||*. 


5 Experiments and Results 

We now demonstrate examples of tweaking the SI regression to gain advantage. In the first experi¬ 
ment we show that nonlinear regression can be used to reduce the number of parameters needed in 
SI, thereby improving statistical performance for learning an HMM. In the second experiment we 
show that we can encode prior knowledge as regularization. 

5.1 Learning A Knowledge Tracing Model 

In this experiment we attempt to model and predict the performance of students learning from an 
interactive computer-based tutor. We use the Bayesian knowledge tracing (BKT) model 02, which 
is essentially a 2-state HMM: the state s t represents whether a student has learned a knowledge 
component (KC), and the observation o t represents the success/failure of solving the t th question in 
a sequence of questions that cover this KC. Figure [3] summarizes the model. The events denoted by 
guessing, slipping, learning and forgetting typically have relatively low probabilities. 

5.1.1 Data Description 

We evaluate the model using the “Geometry Area (1996-97)” data available from DataShop l20l . 
This data was generated by students learning introductory geometry, and contains attempts by 59 
students in 12 knowledge components. As is typical for BKT, we consider a student’s attempt at a 
question to be correct iff the student entered the correct answer on the first try, without requesting 
any hints from the help system. Each training sequence consists of a sequence of first attempts for a 
student/KC pair. We discard sequences of length less than 5, resulting in a total of 325 sequences. 
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Current Observation Current State Next State 



Figure 3: Transitions and observations in BKT. Each node represents a possible value of the state 
or observation. Solid arrows represent transitions while dashed arrows represent observations. 


5.1.2 Models and Evaluation 

Under the (reasonable) assumption that the two states have distinct observation probabilities, this 
model is 1-observable. Hence we define the predictive state to be the expected next observation, 
which results in the following statistics: if) t = ot and = ot ®k Ot+i, where o t is represented by 
a 2 dimensional indicator vector and 0^ denotes the Kronecker product. Given these statistics, the 
extended state p t = E[£ t | oi ; t_i] is a joint probability table of o t:t + 1 - 

We compare three models that differ by history features and SI regression method: 

Spec-HMM: This baseline uses h t = Ot-i and linear SI regression, making it equivalent to the 
spectral HMM method of 0, as detailed in the supplementary material. 

Feat-HMM: This baseline represents h t by an indicator vector of the joint assignment of the previ¬ 
ous 6 observations (we set b to 4) and uses linear SI regression. This is essentially a feature-based 
spectral HMM |T2l . It thus incorporates more history information compared to Spec-HMM at the 
expense of increasing the number of SI parameters by 0(2 b ). 

LR-HMM: This model represents h t by a binary vector of length b encoding the previous b obser¬ 
vations and uses logistic regression as the S1 model. Thus, it uses the same history information as 
Feat-HMM but reduces the number of parameters to 0(6) at the expense of inductive bias. 

We evaluated the above models using 1000 random splits of the 325 sequences into 200 training 
and 125 testing. For each testing observation ot we compute the absolute error between actual and 
expected value (i.e. |6 0t=1 — P(o t = 1 I o 1: t_i)|). We report the mean absolute error for each split. 
The results are displayed in Figure |4| 6 7 | We see that, while incorporating more history information 
increases accuracy (Feat-HMM vs. Spec-HMM), being able to incorporate the same information 
using a more compact model gives an additional gain in accuracy (FR-HMM vs. Feat-HMM). We 
also compared the FR-HMM method to an HMM trained using expectation maximization (EM). We 
found that the FR-HMM model is much faster to train than EM while being on par with it in terms 
of prediction errorj^] 

5.2 Modeling Independent Subsystems Using Lasso Regression 

Spectral algorithms for Kalman filters typically use the left singular vectors of the covariance be¬ 
tween history and future features as a basis for the state space. However, this basis hides any sparsity 
that might be present in our original basis. In this experiment, we show that we can instead use 
lasso (without dimensionality reduction) as our SI regression algorithm to discover sparsity. This is 
useful, for example, when the system consists of multiple independent subsystems, each of which 
affects a subset of the observation coordinates. 

To test this idea we generate a sequence of 30-dimensional observations from a Kalman filter. Obser¬ 
vation dimensions 1 through 10 and 11 through 20 are generated from two independent subsystems 
of state dimension 5. Dimensions 21-30 are generated from white noise. Each subsystem’s transi¬ 
tion and observation matrices have random Gaussian coordinates, with the transition matrix scaled 


6 The differences have similar sign but smaller magnitude if we use RMSE instead of MAE. 

7 We used MATLAB’s built-in logistic regression and EM functions. 
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Spec-HMM 


Spec-HMM 


Feat-HMM 


Model 

Spec-HMM 

Feat-HMM 

LR-HMM 

EM 

Training time (relative to Spec-HMM) 

1 

1.02 

2.219 

14.323 


Figure 4: Experimental results: each graph compares the performance of two models (measured 
by mean absolute error) on 1000 train/test splits. The black line is x = y. Points below this line 
indicate that model y is better than model x. The table shows training time. 



Figure 5: Left singular vectors of (left) true linear predictor from ot -1 to Ot (i.e. OTO + ), (middle) 
covariance matrix between o t and o t -\ and (right) SI Sparse regression weights. Each column 
corresponds to a singular vector (only absolute values are depicted). Singular vectors are ordered by 
their mean coordinate, interpreting absolute values as a probability distribution over coordinates. 


to have a maximum eigenvalue of 0.95. States and observations are perturbed by Gaussian noise 
with covariance of 0.01/ and 1.0 1 respectively. 

We estimate the state space basis using 1000 examples (assuming 1-observability) and compare the 
singular vectors of the past to future regression matrix to those obtained from the Lasso regression 
matrix. The result is shown in figure [5] Clearly, using Lasso as stage 1 regression results in a basis 
that better matches the structure of the underlying system. 

6 Conclusion 

In this work we developed a general framework for dynamical system learning using supervised 
learning methods. The framework relies on two key principles: first, we extend the idea of predictive 
state to include extended state as well, allowing us to represent all of inference in terms of predictions 
of observable features. Second, we use past features as instruments in an instrumental regression, 
denoising state estimates that then serve as training examples to estimate system dynamics. 

We have shown that this framework encompasses and provides a unified view of some previous 
successful dynamical system learning algorithms. We have also demostrated that it can be used 
to extend existing algorithms to incorporate nonlinearity and regularizers, resulting in better state 
estimates. As future work, we would like to apply this framework to leverage additional techniques 
such as manifold embedding and transfer learning in stage 1 regression. We would also like to 
extend the framework to controlled processes. 
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A Spectral and HSE Dynamical System Learning as Regression 

In this section we provide examples of mapping some of the successful dynamical system learning 
algorithms to our framework. 

A.l HMM 

In this section we show that we can use instrumental regression framework to reproduce the spec¬ 
tral learning algorithm for learning HMM We consider 1-observable models but the argument 
applies to fc-observable models. In this case we use ip t = e 0t and = e 0t . t+1 = e 0t (&k e Qt+1 , 

where denotes the kronecker product. Let Pij = E[e 0i ® e D .] be the joint probability table of 

observations i and j and let Pjj be its estimate from the data. We start with the (very restrictive) 
case where Pi 2 is invertible. Given samples of /12 = e Ql , ip 2 = e D2 and £2 = e 02 . 3 , in SI regression 
we apply linear regression to learn two matrices W 2 j and W 2 : 3 ,i such that: 

IE ['02 | ft-2] = E 020l E 0l 1 /l2 = p2 } lP 1 lht = W2,\h2 (A.l) 

IE[^2 I ^- 2 ] = ^' 02 : 30 iS 0 l 1 h ,2 = A:3,l-Pl,1^2 = TA / 2:3,l^ i 2, (A.2) 


where P 2:3 ,i = E[e 02;3 <g» e Gl ] 

In S2 regression, we learn the matrix W that gives the least squares solution to the system of equa¬ 
tions 

m 2 \h 2 ] = ^ 2 : 3,1601 = W(W 2t ie 0l ) = WE[ip 2 \h 2 ] , for given samples of h 2 
which gives 

W = W2..3, Ae 0l el]Wj A (w 2 ,it[e 0 l e T 0 l ]WZi )"' 

= (ArS.lATlATl) (A.iATi^Ti) 1 

= -A:3,l (A,l^ (A.3) 


Having learned the matrix W, we can estimate 


Pt = Wq t 


starting from a state q t . Since p f specifies a joint distribution over e 0t+1 and e 0t we can easily 
condition on (or marginalize o t ) to obtain qt+i- We will show that this is equivalent to learning and 
applying observable operators as in 0 : 

For a given value x of 02 , define 

B x =u T x W = uj P 2 , 3,1 (ATi) _1 . (A.4) 

where u x is an \0\ x \0\' 2 matrix which selects a block of rows in P 2 : 3 ,i corresponding to 02 = x. 
Specifically, u x = S x ® fc 


qt+ 1 = E[e 0 t+ 1 |oi :t ] oc ult[e 0t:t+1 \o 1:t _i\ 

= uJ t E[^ t |oi :t _i] = uJ t WE [ijj t \oi-t-i] = B 0t q t 

with a normalization constant given by 


1 

1 T B at q t 

8 Following the notation used in ®,ujh:3,l=h,x,l 


(A.5) 
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Now we move to a more realistic setting, where we have rank(P 2 ,i) = m < \0\. Therefore we 
project the predictive state using a matrix U that preserves the dynamics, by requiring that U t O 
(i.e. U is an independent set of columns spanning the range of the HMM observation matrix ()). 

It can be shown J3] that TZ(O) = 1Z(P 2 +) = P(P 2 ! iP 1 _ 1 1 ). Therefore, we can use the leading m left 

singular vectors of W 2 + , which corresponds to replacing the linear regression in S1A with a reduced 
rank regression. However, for the sake of our discussion we will use the singular vectors of P 2 +. In 
more detail, let [U, S, V] be the rank-//; SVD decomposition of P 2 ,i- We use ip t = U T e 0t and = 
e 0t U T e 0t+1 . SI weights are then given by WJ.j = U t W 2 ,i and = (I\ 0 \ ® k U T )W 2:3 + 

and S2 weights are given by 

W rr = (. I\ 0 \ ®k U T )W 2 ,^ 1 t[e 0l e T 0l \Wl- l U (u T W^eo^W^u) 

= (i\o\ u T )P 2:3tl P^vs (sv-rPi'vsy 1 

= ( I\o\ ®k U T )P 2 .. 3A P^V (v T P^ iV) _1 5 - 1 (A. 6 ) 

In the limit of inhnite data, V spans range(O) = rowspace(P 2 : 3 ) i) and hence P 2:3t i = P 2:3 } iVV t . 
Substituting in ( |A. 6 | > gives 

w rr = (J|C| ® k U T )P 2:3 +VS ~ 1 = (I|0| P T )P 2:3 ,1 (U t P 2 ,i) + 

Similar to the full-rank case we define, for each observation x an m x \0\ 2 selector matrix u x = 
5 X 0 k Im and an observation operator 

B x = ulW rr ^U T P 3>Xtl (U T P 2>1 ) + (A.7) 


with P 3jX ,i and P 2 i replaced by their empirical estimates. 

Note that for a state b t = E[p t \o 1:t -i], B x b t = P(o t |o 1 :t _ 1 )E[^ t +i|o 1;t ] = P{o t |oi :t _i) 6 t+ i. To 
get bt+ 1 , the normalization constant becomes p( Qt |o lt t ) = k T b b t ' w h ere b^b = 1 for any valid 
predictive state b. To estimate boo we solve the aforementioned condition for states estimated from 
all possible values of history features h t . This gives, 

blW^I\o\ = blU T P 2A Prp l0 \ = lJo\, 

where the columns of I\q\ represent all possible values of h t . This in turn gives 

bio = lio|A, 1 (tf T A, 1 ) + 

= Pl(u T P 2 , 1 ) + 1 

the same estimator proposed in 0 . 


This is exactly the observation operator obtained in yi|. However, instead of using|A. 6 j they use A.7 


A.2 Stationary Kalman Filter 

A Kalman filter is given by 


s t = Ost-i + v t 
Ot = Tst + €t 
Vt ~ AA(0, E 8 ) 


We consider the case of a stationary filter where £ t = E[s t Sj ] is independent of t. We choose our 
statistics 

ht = Ot-H-.t -1 
Pt = Of.t+F -1 
6 = Of.t+F, 
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Where a window of observations is represented by stacking individual observations into a single 
vector. It can be shown fraiSl that 

and it follows that 

Efokl h t ) = TE s ^\h t = Wyht 
e[6 N = r+s^s -\h t = W 2 h t 


where l is the extended observation operator 

/ ° \ 

OT 


r = 


> r + = 


V ot f ) 


o 

OT 


\ OT F+l ) 


It follows that F and H must be large enough to have rank(VF) = n. Let U £ be the matrix 

of left singular values of W\ corresponding to non-zero singular values. Then U 1 I’ is invertible and 
we can write 

Efo^N = UU T TY, s , h Y^ h ht = Wxht 

£[6N = r+E Sih E ~lh t = W 2 h t 

E[6M =r+([/ T r)- 1 [/ T (uu T TE s ^ h ht) 

= WE.[ip t \h t ] 


which matches the instrumental regression framework. For the steady-state case (constant Kalman 
gain), one can estimate Ec given the data and the parameter W by solving Riccati equation as 
described in f8). E[£t|oi : t_i] and E^ then specify a joint Gaussian distribution over the next F + 1 
observations where marginalization and conditioning can be easily performed. 

We can also assume a Kalman filter that is not in the steady state (i.e. the Kalman gain is not 
constant). In this case we need to maintain sufficient statistics for a predictive Gaussian distribution 
(i.e. mean and covariance). Let vec denote the vectorization operation, which stacks the columns of 
a matrix into a single vector. We can stack ht and vec (h t hj ) to into a single vector that we refer to 
as lst+2nd moments vector. We do the same for future and extended future. We can, in principle, 
perform linear regression on these lst+2nd moment vectors but that requires an unnecessarily large 
number of parameters. Instead, we can learn an SI A regression function of the form 

E[iPt\h t ] = Wtht (A.8) 

E [iPtipJ | h] = WththJ Wi + R (A.9) 

(A. 10) 

Where R is simply the covariance of the residuals of the 1st moment regression (i.e. covariance of 
ft = tf’t ~ E[ip t \ht])- This is still a linear model in terms of lst+2nd moment vectors and hence 
we can do the same for SIB and S2 regression models. This way, the extended belief vector p t (the 
expectation of lst+2nd moments of extended future) fully specifies a joint distribution over the next 
F + 1 observations. 

A.3 HSE-PSR 

We define a class of non-parametric two-stage instrumental regression models. By using conditional 
mean embedding |[2~l1 as SI regression model, we recover a single-action variant of HSE-PSR (9). 
Let A' .y,Z denote three reproducing kernel Hilbert spaces with reproducing kernels kx,ky and 
kz respectively. Assume i) t £ T and that £ y is defined as the tuple (o* (g> Ot, V’t+i <8* ot)- Let 
£ A’®K JV , S £ and H £ be operators that represent training data. Specifically, 
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ip s , £ s , h s are the s th ’’columns” in SI/ and S and H respectively. It is possible to implement SI 
using a non-parametric regression method that takes the form of a linear smoother. In such case the 
training data for S2 regression take the form 

N 

^bPt I h t ) = ^2/3 a \h t i(>8 

s =1 
N 

E[6 I ht) =^2js\hits, 

8=1 

where (3 S and 7 S depend on h t . This produces the following training operators for S2 regression: 

= VB 

where B st = fi s \k, and T st = With this data, S2 regression uses a Gram matrix formulation 

to estimate the operator 

W = ET{B t G x , x B + \I n )- 1 B t 'S’* (A.l 1) 

Note that we can use an arbitrary method to estimate B. Using conditional mean maps, the weight 
matrix B is computed using kernel ridge regression 

B = (Gz,z + A/jv) 1 Gz,z (A.12) 

HSE-PSR learning is similar to this setting, with ip t being a conditional expectation operator of test 
observations given test actions. For this reason, kernel ridge regression is replaced by application of 
kernel Bayes rule IflOl . 

For each t, SI regression will produce a denoised prediction E[£ t \ h t ] as a linear combination of 
training feature maps 

N 

E[£t | ht] = £1 Ut = 5 ' dt,s£s 
s=l 


This corresponds to the covariance operators 

N 

£. 0 t+lOt |ht = 51 a t,si>s +1 ® 0 S = , ® ,, diag(a t ) 0 * 

S — 1 

N 

Ko t \h t = 5 Z Qi . s ° 3 ®°s = 0 diag(a t ) 0 * 

S—1 

Where, SP' is the shifted future training operator satisfying SF'et = ipt+i Given these two covariance 
operators, we can use kernel Bayes rule ifTOl to condition on o t which gives 

<Zt+l = E[lpt+1 | ht] = ^T/, t+1 o t \h t (Eo t o t \h t + A/) lQ t- (A.13) 

Replacing o* in ( |A. 1 3[ > with its conditional expectation a sO s corresponds to marginalizing 

over ot (i.e. prediction). A stable Gram matrix formulation for ( |A. 1 3| ) is given by flOl 

Qt+ 1 

= ^''diag(at)Go,o((diag(a t )G 0 ,o) 2 + A TV/) -1 

.diag(a t )O* 0 t+ i 

= (A. 14) 

which is the state update equation in HSE-PSR. Given dt+i we perform S2 regression to estimate 

Pt+1 = E[6+1 I 0i:t+i] = 3at+i = 

where W is defined in (|A.11|). 
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B Proofs 


B.l Proof of Main Theorem 


In this section we provide a proof for theorem [2] We provide finite sample analysis of the effects 
of SI regression, covariance estimation and regularization. The asymptotic statement becomes a 
natural consequence. 


We will make use of matrix Bernstein’s inequality stated below: 

Lemma B.l (Matrix Bernstein’s Inequality ll22T0 . Let Abe a random square symmetric matrix, and 
r > 0, v > 0 and k > 0 be such that, almost surely, 

E[A] =0, A max [A] < r, 

A max [E[A 2 ]] < v, tr(E[A 2 ]) < k. 


If A]_,A 2 ,..., An are independent copies of A, then for any t > 0. 

N 


Pr 


An 


1 


]fl> 


t=l 


> 
kt, 


2vt rt 
N~ + 3N 

\-i 


< —(e* — t — 1) . 

v 


Ift > 2.6, then f(e* — t — 1) 1 < e */ 2 . 


(B.l) 


Recall that, assuming Xt es t G 7?.(E 2£ ), we have three sources of error: first, the error in SI re¬ 
gression causes the input to S2 regression procedure (x t , yt ) to be a perturbed version of the true 
{%t,yt)\ second, the covariance operators are estimated from a finite sample of size TV; and third, 
there is the effect of regularization. In the proof, we characterize the effect of each source of error. 
To do so, we define the following intermediate quantities: 

Wx = Eg* + A/)” 1 (B.2) 

W x = £ m (£^ + \l)~\ (B.3) 

where 

1 N _ 

£yx = ^2 y* ® Xt 

v t=l 


and is defined similarly. Basically, W\ captures only the effect of regularization and W\ cap¬ 
tures in addition the effect of finite sample estimate of the covariance. W\ is the result of S2 
regression if x and y were perfectly recovered by SI regression. It is important to note that E l:l/ and 
are not observable quantities since they depend on the true expectations x and y. We will use 
A.,.,; and A, ;i to denote the i th eigenvalue of E r;;: and E ?/y respectively in descending order and we 
will use ||.|| to denote the operator norm. 


Before we prove the main theorem, we define the quantities Q X N and (f v N which we use to bound 
the effect of covariance estimation from finite data, as stated in the following lemma: 

Lemma B.2 (Covariance error bound). Let N be a positive integer and S € (0,1) and assume that 
||x||,||y||<c<oo almost surely. Let C% V N be defined as: 



where 


(B.4) 


t = max(2.6, 2 log(4/c/<5zz)) 
r = c 2 + ||Ey$|| 

v = c 2 max(A y i, A^i) + ||E S y|| 2 
k = c 2 (tr(E s$ ) + tr(Ej w )) 
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In addition, let Q X N be defined as: 


where 


/-XX _ 

— 



r't' 
+ 31V’ 


t' = max(2.6, 2 log(4 k'/5v')) 
f' = c 2 + A x i 
v' = c 2 X x i + X xl 
k' = c 2 tr(£ 22 ) 

and define C'/'% similarly for S yy. 

It follows that, with probability at least 1 — 5/2, 

II ^yx — Eg 2 || < 

||E*s - || < Cfjv 

II Ej/y — Egg || < 


(B.5) 


Proof. We show that each statement holds with probability at least 1 — 5/6. The claim then follows 
directly from the union bound. We start with Q X N . By setting A t = Xt ® Xt — £ 22 then we would 

like to obtain a high probability bound on || X^t Lemma B.l shows that, in order to satisfy 


the bound with probability at least 1 — 5/6, it suffices to set t to max(2.6, 2fc log(6/5t;)). So, it 
remains to find suitable values for r, v and k: 


Amaxb] < ||at|| 2 + II Eg® || < C 2 + A 2 i — r' 

Amax[E[v4“]| = A max pE[||tc|| (x 0 x) (x® at)£ 22 E 22 (at g) x') T" £ 22 ] 

= A m ax[E[||x|| 2 (x g x) - £ 22 2 ]] < c 2 X xl + A 2 ! = v’ 
tr[E[y4 2 ]] = tr[E[||S|| 2 (i g x) - £ 22 2 ]] < tr[E[||5|| 2 (S g x)]] < c 2 tr(£ 22 ) = k’ 


The case of Q V N can be proven similarly. Now moving to Q V N , we have B t = yt®Xt — £g 2 . Since 
B t is not square, we use the Hermitian dilation J^f(B) defined as follows l23l : 


Note that 


A = Jf(B) 


0 B 
B* 0 


A max [A] = ||B||, A 2 


BB* 0 
0 B*B 


therefore suffices to bound || -L JJtLi || using an argument similar to that used in <ff x N case. O 


To prove theorem [2] we write 

||Xtest - Wstertlly < ||(Wa - W A )Stestb 
+ ||(^A- WA^testb 

+ \\{W X - VLbtestlly (B.6) 

We will now present bounds on each term. We consider the case where attest £ 7?.(E 22 ). Extension 
to 1Z(Y, XX ) is a result of the assumed boundedness of IT, which implies the boundedness of W\ — W. 
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Lemma B.3 (Error due to SI Regression). Assume that ||x||, ||y|| < c < oo almost surely, and let 
r]s,N be as defined in Definition [7J The following holds with probability at least 1 — <5 


\\W x -W x \\<JX yl + C V s V N 


(2 cris,N + Vs,n 2 ) 


\ 3 

A 2 


+ 


(2 C7^jy + rjs,N 2 ) 

A 


= O ( rj s ,N | ^ + 


0^1og(l/5) \ ' 


An 


A* 




The asymptotic statement assumes ps,N —> 0 as N —> oo. 


Proof. Write Egg = Egg + A x and Egg = E yyX + A yx . We know that, with probability at least 
1 — 8/2, the following is satisfied for all unit vectors <f> x £ X and fi y £ y 

1 N 

((f>y, A yx (f>x)y = y ] {fiy: ytjyifix, tit)X 

V t=l 

— {^yi Vt)y{4 > xi %t)X 

T { < i , yiyt)y( , t > xitrt)x ~ i't’yi Vt) y (tyxi x t) x 
= + (Vt - yt))y(<t>x,Xt-x t )x 

t 

t {fiyi i/t — yt/yifixi Xt) x 

< 2ctis,n + r]g N 

Therefore, 

|A ya; || — SUp ((f) yi A yx (j) x )y f 2ct]g,N T" Vs.Ni 

\\ 4 >A\x< 1 ,\\ 4 >y\\y<t 


and similarly 

|||| < 2cpg,N + ys,N 2 , 

with probability 1 — 8/2. We can write 

W x - W x = Eg* ((Ex* + A, + XI)- 1 - (Egg + A/)- 1 ) 

+ Ag^Egg + A x + XI) 1 

Using the fact that B _1 — A~ x = B~ 1 {A — B)A~ l for invertible operators A and B we get 

W\ — W\ = — Egg(Egg + XI) 1 A X (E XX + A x + XI) 1 
+ A ya; (Egg + A x + XI) 1 

^ ^ 1 A 1 

we then use the decomposition Egg = EggUEJg, where V is a correlation operator satisfying 
|| U|| < 1. This gives 

W x -W x = 

— ^SyV^ixi^xx + XI) 2 (Egg + A I) 2 A x (Egg + A x + XI) 1 
+ A ya; (Egg + A x + XI) 1 

~ a » _ i 

Noting that ||EJ $ (Egg + XI) 2 1| <1, the rest of the proof follows from triangular inequality and 
the fact that || AB|| < P||||B|| □ 
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Lemma B.4 (Error due to Covariance). Assuming that ||ar||ar)||j/||y < c < oo almost surely, the 
following holds with probability at least 1 — f 


\\W x -Wx\\< 


, where (ff x N and (fjt v N are as defined in Lemma 


B.2. 


r x v 

^5,N 


Proof. Write E $$ = E $$ + A x and E yx = T, m + A yx . Then we get 

W\ — W\ = E y X ((S^x + tA x + A I) 1 — (Egg + A I) 1 ) + A yx (S xx + A x + XI) 

Using the fact that f? _1 — A -1 = B _1 {A — B)A~ l for invertible operators A and B we get 

W\ — W\ = — Aiy X (Yi xx + XI) 1 A x (X2 xx + A x + XI) 1 + A yx (Yl xx + + XI) 

1 1 

we then use the decomposition Ti yx = E yy 2 VT, XX 2 , where V is a correlation operator satisfying 
||U|| < 1. This gives 

Wx-W x = 

~ ^-“yy 5 V^xx 5 (Sax + XI) 2 (Eax + XI) 2 
■A x (T, xx + A x + XI) 1 
+ ‘A.yxi'^xx + *A. X + XI) 1 

Noting that ||Eg $ 2 (E xx + XI)~i || < 1, the rest of the proof follows from triangular inequality and 
the fact that j| AB|| < m||||B|l □ 

Lemma B.5 (Error due to Regularization on inputs within IZ(T, XX )). For any x £ TZ(T, XX ) s.t. 
||x||at < 1 and ||S $ g _2 a;||^ < C. The following holds 

\\{W x -W)x\\y< l -VX\\W\\ HS C 


Proof Since x £ 1 Z(Tj xx ) C IZ(T, xx 2 ), we can write x = E $s 2 v for some v £ X s.t. ||u|| x < C. 

Then 

{W\ — W)x = Sy£.((Eax + XI) 1 — Yi X x l )^xx^ v 


Let D = T,y X ((A, xx + XI) 1 — E xx 1 )S ffi g 2 . We will bound the Hilbert-Schmidt norm of D. 
Let ip xi £ X , tpyi £ y denote the eigenvector corresponding to X :r/I and X yl respectively. Define 

Sij = \{lpyj, Zxyi>xi)y\- Then we have 


\(t/)yj, Dlpxi)y\ 


(ll)yj,'Ey X - 1p xi ) 

\Axi i A)y A x i y 

XSij Sij 1 

(A xi T A)-\/A xi s/Xfi \ xi "t“ 1 


< 



1 

2 



1 

2 


i Wll>xi)y | > 


where the inequality follows from the arithmetic-geometric-harmonic mean inequality. This gives 
the following bound 

\\D\\hs = E ^vP D ^ify < I^WWhs 
ij 
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and hence 


||(W A - W)x\\ y < ||£>|||M|* < II^IMMk 

< \^\\W\\hsC 


□ 


Note that the additional assumption that || T, xx 2 x\\x < C is not required to obtain an asymptotic 
0(V A) rate for a given x. This assumption, however, allows us to uniformly bound the constant. 
Theorem[2]is simply the result of plugging the bounds in Lemmata B.3 B.4| and B.5 into ( | B.6 [ > and 
using the union bound. 

B.2 Proof of Lemma|3] 

for t = 1: Let I be an index set over training instances such that 


Then 


\\QT st Qrik = ^ E WQi QiWx %EIIQi- QiWx < 


\z\ 


iex 1 1 tel 

for t > 1: Let A denote a projection operator on "RA (E yy ) 

II QlTi - QV&Wx < L \\ p t est - P? st \\y < L\\AW x QT st \\y 

y -1 


< L 


< L 


1 

N 


N 


Y, APi (g) Q» ) ( ^ E & ® & + A/ 


N 


Kl= 1 

N 


N 


Y APi <g> AP\ 


i= 1 


AT 




« 


test 


A* 


AT ^ 


< L 


Vx 


II Q\ 


\x, 


where the second to last inequality follows from the decomposition similar to Eyx = E f V E ^, 
and the last inequality follows from the fact that \\AP t \\y<\\Pi-Pi\\y. □ 


C Examples of SI Regression Bounds 

The following propositions provide concrete examples of SI regression bounds r}g,N for practical 
regression models. 

Proposition C.l. Assume X = W 1 "'-, W iy . R rf -- for some d x ,d y ,d z < oo and that x and y are linear 
vector functions of z where the parameters are estimated using ordinary least squares. Assume that 
||x|| x, || j/||;y < c < oo almost surely. Let ip y N be as defined in Definition |7j Then 

V6,n = 0 ( \/4 log ((dx + dy)/6)) 


Proof, (sketch) This is based on results that bound parameter estimation error in linear regression 
with univariate response (e.g. El). Note that if x tl = Uj Zt for some Ui £ Z , then a bound on 
the error norm \\Ui — U-, \| implies a uniform bound of the same rate on .f, — x. The probability of 
exceeding the bound is scaled by l/(d x + d y ) to correct for multiple regressions. □ 

Variants of Proposition |C. 1 1 can also be developed using bounds on non-linear regression models 
(e.g., generalized linear models). 

The next proposition addresses a scenario where X and y are infinite dimensional. 
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Proposition C.2. Assume that x and y are kernel evaluation functionals, x and y are linear vector 
functions of z where the linear operator is estimated using conditional mean embedding with 
regularization parameter Ao > 0 and that ||x||^, ||y||y < c < oo almost surely. Let rjs.N be as 
defined in Definition [7] It follows that 


n ( /y Ilog(N/6) \ 

vs ’ N = oy /Xo + \I^^J 


Proof (sketch) This bound is based on ED, which gives a bound on the error in estimating the 
conditional mean embedding. The error probability is adjusted by S/4N to accommodate the re¬ 
quirement that the bound holds for all training data. □ 
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